Purpose : To practice Pandas, Matplotlib and Seaborn libraries on dataset 'speed dating' from Kaggle : https://www.kaggle.com/annavictoria/speed-dating-experiment
We will try to understand which criterias improve the chance to get a date during a speed dating event.
As the purpose of the exercise is mainly to practice data vizualisation python libraries, we will not be exhaustive about all the data we can look at. We will plot relevant graphs and try to find some patterns but we will not invistigate everything or even perform machine learning algorithm (except for some regression included in graphs).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
speed_dating = pd.read_csv("data/Speed Dating Data.csv", encoding='ISO-8859-1')
speed_dating.head()
| iid | id | gender | idg | condtn | wave | round | position | positin1 | order | ... | attr3_3 | sinc3_3 | intel3_3 | fun3_3 | amb3_3 | attr5_3 | sinc5_3 | intel5_3 | fun5_3 | amb5_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 4 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 3 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 10 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 5 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 7 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 195 columns
In this part we will look at personnal information as declared in the study by attendees themself. The goal is to firstly know better about getting some statistical clues about attendees, how they feel about themselves and their expectations about the event, before looking at the matching results.
Let's firstly look at information given by participants. As the dataset structure is "a line = a meeting between two attendees", we firstly extract information for each 'iid' line :
features_perso = ['iid',
'gender',
'idg',
'age',
'field',
'field_cd',
'race',
'imprace',
'imprelig',
'from',
'zipcode',
'goal',
'date',
'go_out',
'career',
'career_c',
'exphappy',
'expnum',
'attr1_1',
'attr4_1',
'attr2_1',
'attr3_1',
'attr5_1']
# We did not keep all features in order to simplify the problem.
# The choice of keeping only "attractivity" instead of "fun", "sharing same interests", etc
# was made knowing that all those criterias are linked to each other (see the end of the notebook).
speed_dating[features_perso].head()
| iid | gender | idg | age | field | field_cd | race | imprace | imprelig | from | ... | go_out | career | career_c | exphappy | expnum | attr1_1 | attr4_1 | attr2_1 | attr3_1 | attr5_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 21.0 | Law | 1.0 | 4.0 | 2.0 | 4.0 | Chicago | ... | 1.0 | lawyer | NaN | 3.0 | 2.0 | 15.0 | NaN | 35.0 | 6.0 | NaN |
| 1 | 1 | 0 | 1 | 21.0 | Law | 1.0 | 4.0 | 2.0 | 4.0 | Chicago | ... | 1.0 | lawyer | NaN | 3.0 | 2.0 | 15.0 | NaN | 35.0 | 6.0 | NaN |
| 2 | 1 | 0 | 1 | 21.0 | Law | 1.0 | 4.0 | 2.0 | 4.0 | Chicago | ... | 1.0 | lawyer | NaN | 3.0 | 2.0 | 15.0 | NaN | 35.0 | 6.0 | NaN |
| 3 | 1 | 0 | 1 | 21.0 | Law | 1.0 | 4.0 | 2.0 | 4.0 | Chicago | ... | 1.0 | lawyer | NaN | 3.0 | 2.0 | 15.0 | NaN | 35.0 | 6.0 | NaN |
| 4 | 1 | 0 | 1 | 21.0 | Law | 1.0 | 4.0 | 2.0 | 4.0 | Chicago | ... | 1.0 | lawyer | NaN | 3.0 | 2.0 | 15.0 | NaN | 35.0 | 6.0 | NaN |
5 rows × 23 columns
Now we can group by "iid", wich means we get a dataset with "one line = one attendee", in order to better work on personnal information :
sd_group_iid = speed_dating[features_perso].groupby('iid').mean().reset_index()
Attendees were asked if sharing the same religion was an important criterion to them. They had to mark this importance from 0 to 10.
p = sns.catplot( y = 'imprelig', kind= 'box', data = sd_group_iid)
p.set(title='Importance of common religion for attendees')
p.set_ylabels('Mark from 0 to 10')
plt.show()
Religion doesn't seem to have a great importance among this population's sample, as 75% of them put a mark lower than 6. Let's look at the 'race' feature.
p = sns.catplot( y = 'imprace', kind= 'box', data = sd_group_iid)
p.set(title='Importance of common race for attendees')
p.set_ylabels('Mark from 0 to 10')
plt.show()
Same conclusion as religion. Signification on match ?
p = sns.catplot(x = 'samerace', y = 'match', data = speed_dating, kind = 'bar')
p.set_ylabels('Match rate')
p.set_xlabels('Same race ?')
p.set_xticklabels(["No","Yes"])
plt.show()
It firstly appears that a common race makes it a little easier match, but given the confidence intervals we can say that it is not really significative.
"What is your primary goal in participating in this event ?"
temp_df = sd_group_iid.groupby('goal').count().reset_index()
temp_df.loc[temp_df.goal == 1,['goal']] ='Seems like a fun night out'
temp_df.loc[temp_df.goal == 2,['goal']] ='To meet new people'
temp_df.loc[temp_df.goal == 3,['goal']] ='To get a date'
temp_df.loc[temp_df.goal == 4,['goal']] ='Looking for a serious relationship'
temp_df.loc[temp_df.goal == 5,['goal']] ='To say I did it'
temp_df.loc[temp_df.goal == 6,['goal']] ='Other'
sns.set_theme(font_scale = 1.3)
p = sns.catplot(x = 'goal', y = 'iid', data = temp_df, kind = 'bar', height = 5, aspect = 3)
p.set(title="Attendees' goal for participating the event")
p.set_xlabels("")
p.set_ylabels("Number of attendees")
plt.show()
Most people said they came here mostly for fun and meeting new people, not to find love or even a date.
temp_df = sd_group_iid.groupby('date').count().reset_index()
temp_df.loc[temp_df.date == 1,['date']] ='Several times a week'
temp_df.loc[temp_df.date == 2,['date']] ='Twice a week'
temp_df.loc[temp_df.date == 3,['date']] ='Once a week'
temp_df.loc[temp_df.date == 4,['date']] ='Twice a month'
temp_df.loc[temp_df.date == 5,['date']] ='Once a month'
temp_df.loc[temp_df.date == 6,['date']] ='Several times a year'
temp_df.loc[temp_df.date == 7,['date']] ='Almost never'
plt.pie(x = temp_df.iid, radius = 2, labels = temp_df.date, autopct = '%1.1f%%')
plt.title("Attendees repartition to question :\n\"In general, how frequently do you go on dates?\"",
pad = 120,
fontdict = {'fontsize': 20})
plt.show()
temp_df = sd_group_iid.groupby('go_out').count().reset_index()
temp_df.loc[temp_df.go_out == 1,['go_out']] ='Several times a week'
temp_df.loc[temp_df.go_out == 2,['go_out']] ='Twice a week'
temp_df.loc[temp_df.go_out == 3,['go_out']] ='Once a week'
temp_df.loc[temp_df.go_out == 4,['go_out']] ='Twice a month'
temp_df.loc[temp_df.go_out == 5,['go_out']] ='Once a month'
temp_df.loc[temp_df.go_out == 6,['go_out']] ='Several times a year'
temp_df.loc[temp_df.go_out == 7,['go_out']] ='Almost never'
explode = (0,0,0,0.2,0.3,0.9,0.9)
plt.pie(x = temp_df.iid, radius = 2, labels = temp_df.go_out, autopct = '%1.1f%%', explode = explode)
plt.title("Attendees repartition to question :\n\"How often do you go out (not necessarily on dates)?\"",
pad = 120,
fontdict = {'fontsize': 20})
plt.show()
Most attendees go out often.
p = sns.catplot(x = 'exphappy',
y = 'iid',
data = sd_group_iid.groupby('exphappy').count().reset_index(),
kind = 'bar',
height = 5,
aspect = 2,
color = "b")
p.set(title="Attendees repartition to question :\n\"Overall, on a scale of 1-10, how happy do you expect to be with the people you meet during the speed-dating event?\"")
p.set_xlabels("Mark (0 to 10)")
p.set_ylabels("Number of attendees")
plt.show()
Most people are between 5 and 7 so a little enthousiast about the event, but not so much.
sd_group_iid.groupby('expnum').count().reset_index().head()
| expnum | iid | gender | idg | age | field_cd | race | imprace | imprelig | goal | date | go_out | career_c | exphappy | attr1_1 | attr4_1 | attr2_1 | attr3_1 | attr5_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 0 | 8 | 8 | 0 |
| 1 | 1.0 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 0 | 9 | 9 | 0 |
| 2 | 2.0 | 19 | 19 | 19 | 19 | 18 | 19 | 19 | 19 | 19 | 19 | 19 | 16 | 19 | 19 | 0 | 19 | 19 | 0 |
| 3 | 3.0 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 0 | 17 | 17 | 0 |
| 4 | 4.0 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 0 | 14 | 14 | 0 |
p = sns.catplot(x = 'expnum',
y = 'iid',
data = sd_group_iid.groupby('expnum').count().reset_index(),
kind = 'bar',
height = 5,
aspect = 3,
color = "b")
p.set(title="Attendees repartition to question :\n\"Out of the 20 people you will meet, how many do you expect will be interested in dating you?\"")
p.set_xlabels("Number of persons expected to be interested (out of 20 persons)")
p.set_ylabels("Number of attendees")
plt.show()
Most people think that less than 6 persons over 20 will be interested in them. We also see a peak at "10" : around 15 persons think they will please to half the people they will meet !
Could be interesting to look if confident persons are more likely to please other people. We will plot a graph to see if we have a trend between :
p = sns.regplot(x = speed_dating.groupby('expnum').count().reset_index().expnum,
y = speed_dating.groupby('expnum').mean().reset_index().dec_o,
logistic = True,
)
p.set(title="Interested persons vs expectation of interested persons")
p.set_xlabel("Number of persons expected to be interested (out of 20 persons)")
p.set_ylabel("Ratio of interested persons")
p.figure.set_size_inches(15, 6)
plt.show()
If we delete the outlier at 'expnum' = 14, we see that the regression is pretty good. Confident people (who expected higher person number to be interested in them) are the persons who gets more interest. It is also interesting to note that the slope of the line is less important that the expactations :
=> Chances to please is increased with "high expectations" but does not grow as fast.
It is also important not to forget that the size of population taken into account in each point is very little here (particularly for high expnum values), as we can see in the previous figure. For example, expnum = 13 or 14 only count one person, so the average value is not really representative.
We already saw some explicative information for match explanation in the first part. Let's keep looking at our data to try to explain how to improve the chances of match.
Firstly let's see if we quickly identify some datas correlated with 'Match' in order to focus on them :
corr = speed_dating.corr()
corr = corr.apply(lambda x : (x**2)**(1/2)) # Here we pass the absolute value to get values from 0 to 1, in order to easily sort them later
fig, ax = plt.subplots(figsize=(20,20))
p = sns.heatmap(corr, linewidths=1)
p.set(title="Heatmap : features correlation matrix")
plt.show()
... pretty impossible to read as we have 195 columns. Let's focus on Matches :
p = sns.catplot(x= 'index',
y= 'match',
data = corr['match'].sort_values(ascending = False).iloc[0:20].reset_index(),
kind = 'bar',
height = 5,
aspect = 2.7,
color = 'b')
p.set(title="Correlation : match rate vs features")
plt.show()
'Match' happens when 'dec'= 1 and 'dec_o'=1 (reminder : dec equals 0 if the person is nok for a second date and 1 if he/she is ok; dec_o is the same idea but from the perspective of the other person).
In order to better understand how the partner can make the decision to keep going, let us make the same investigation with 'dec_o' column :
p = sns.catplot(x= 'index',
y= 'dec_o',
data = corr['dec_o'].sort_values(ascending = False).iloc[0:20].reset_index(),
kind = 'bar',
height = 5,
aspect = 2.7,
color = 'b')
p.set(title="Correlation : decision of other attendees vs other features")
plt.show()
We mostly see the qualities attr_o, fun_o and shar_o. Let's look at the correlation between dec_o and those features.
p = sns.regplot(x = 'attr_o', y = 'dec_o', data = speed_dating, logistic = True)
p.set(title="Decision of other attendees rate vs attractivity mark")
p.set_xlabel("Attractivity mark")
p.set_ylabel("Ratio of decision from other attendee")
p.figure.set_size_inches(15, 6)
plt.show()
We see in graph above that, without surprise, the grade an attendee receives about his level of attractivity is correlated with the number of partners who will be ok for a second date. The more an attendee seems attractive, the more he has chance to get a second date.
p = sns.regplot(x = 'fun_o', y = 'dec_o', data = speed_dating, logistic = True)
p.set(title="Decision of other attendees rate vs fun mark")
p.set_xlabel("Fun mark")
p.set_ylabel("Ratio of decision from other attendee")
p.figure.set_size_inches(15, 6)
plt.show()
p = sns.regplot(x = 'shar_o', y = 'dec_o', data = speed_dating, logistic = True)
p.set(title="Decision of other attendees rate vs 'sharing same interests' mark")
p.set_xlabel("'sharing same interests' mark")
p.set_ylabel("Ratio of decision from other attendee")
p.figure.set_size_inches(15, 6)
plt.show()
Attractivity, Fun and Sharing same interests surely help in partners decision.
We can also look if there is a correlation between attractivity and fun :
p = sns.catplot(x= 'index',
y= 'attr_o',
data = corr['attr_o'].sort_values(ascending = False).iloc[0:20].reset_index(),
kind = 'bar',
height = 5,
aspect = 2.7,
color = 'b')
p.set(title="Correlation between attractivity and other features")
p.figure.set_size_inches(15, 6)
plt.show()
fun_o is quite correlated with attr_o too ! Let's look deeper how grades are related :
p = sns.regplot(x = 'fun_o', y = 'attr_o', data = speed_dating)
p.set(title="Correlation between attractivity and fun : all points")
p.figure.set_size_inches(15, 6)
plt.show()
The line shows the correlation but the numerous points make it difficult to interpret.
Let's try with a new df when we take the 'fun_o' average for each 'attr_o' grade :
# we look if we have relevant number of values for attractivity grade :
speed_dating["attr_o"].value_counts()
6.0 1655 7.0 1642 5.0 1260 8.0 1230 4.0 748 9.0 540 3.0 390 10.0 324 2.0 244 1.0 108 0.0 8 6.5 7 7.5 3 9.5 3 8.5 1 9.9 1 10.5 1 3.5 1 Name: attr_o, dtype: int64
For this analysis we can keep grades from 1 to 10, and only integer values :
df_interm = speed_dating[['fun_o', 'attr_o']]
mask = df_interm['attr_o'].isin(range(1,11))
df_interm = df_interm[mask]
df_temp = df_interm.groupby('attr_o').mean().reset_index()
p = sns.regplot(x = 'fun_o', y = 'attr_o', data = df_temp)
p.set(title="Correlation between attractivity and fun : average")
p.figure.set_size_inches(15, 6)
plt.show()
Far better. From an average point of view, we can say that "fun" and "attractivity" are well correlated.
We could see in a previous graph (bars for attr_o correlation) that attr_o is also correlated to shar_o.
We could perform a 3D interactive graph with plotly, with :
import plotly.express as px
df_temp = speed_dating[['fun_o', 'shar_o', 'attr_o']].groupby('attr_o').mean().reset_index()
fig = px.scatter_3d(df_temp, x='fun_o', y='shar_o', z='attr_o', title = "Attractivity VS Fun VS 'sharing same interests'")
fig.show()
The 3 features seem very correlated to each other. Given the time for the project it's difficult to push the analysis to understand which features causes the others (or if other features explain those 3 together...).
Let us see what happens when we plot the 'like_o' average grade depending on 'fun_o' and 'att_o' :
df_temp = speed_dating[['fun_o', 'attr_o', 'like_o']].groupby(['fun_o', 'attr_o']).mean().reset_index()
fig = px.scatter_3d(df_temp, x='attr_o', y='fun_o', z='like_o', title = "Like VS Attractivity VS Fun")
fig.show()
Like grade increases linearly with attractivity and fun.
Based on the data exploration we made, it is hard to answer to the question of improving the chance to match on objective criterions. Match is due to (definition) coordination of two individuals decision to see each other again, and this decision is highly correlated to perceived quality from the other person, mainly attractivity, fun, and interests shared
It is interesting to point out that those three perceived qualities seem very correlated (when we look the average) to each other, we could look deeper to see if a factor explains the other, or if a good grade in all of those characteristics comes from a good global feeling.
We still have a lot to learn from this dataset !